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tion  from  constructing  tests  with  particular  types  of  items  deleted,  and 
(4)  exam  construction  or  processing  procedures  which  would  raise  test 
quality  for  both  Blacks  and  Whites. 

Item  differentiation  levels,  calculated  as  the  difference  in  item- 

difficulty  between  high  and  low  scorers  (I)  value)  and  also  as  the  item- 

total  correlation  (r,  ),  were  found  to  be  lower  for  Blacks  than  for  Whites, 

—it 

partly  because  item-difficulty  levels  were  lower  for  Blacks.  The  highest 
item-differentiation  values  had  corresponding  item-difficulty  levels  which 
were  easier  than  the  median  difficulty  levels,  indicating  that  the  use  of 
easier  item?  should  contribute  to  better  item  differentiation  for  both 
Blacks  and  Whites.  Black-White  score  differences  were  reduced  by  construc- 
tion of  new  tests  using  items  of  similar  difficulty,  but  test  quality  was 
also  reduced.  Both  item  differentiation  and  test  reliability  were  improved 
by  the  construction  of  tests  using  easier  items  or  more  highly  correlated 
items,  with  slight  and  varied  changes  in  score  differences.  The  "best" 
items  initially  selected  by  a sequential  procedure,  applying  an  internal 
criterion,  were  not  the  same  as  those  selected  by  an  external  criteiion. 

An  empirical  validation  of  the  present  tests  on  subsequent  job  performance 
for  both  Blacks  and  Whites  was  recommended,  as  was  a validation  and  comparison 
on  internal  and  external  criteria  of  the  alternative  test  construction  pro- 
cedures identified. 
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FOREWORD 


This  study  was  initiated  in  response  to  a request  from  the  Chief  of 
Naval  Personnel  (Pers-6)  to  determine  the  feasibility  of  developing 
Enlisted  Advancement  Exams  from  items  similar  in  difficulty  for  both 
Black  and  White  racial  groups,  as  an  approach  to  improving  equal  oppor- 
tunity in  career  growth  for  minority  groups.  Previous  studies  examined 
item-difficulty  levels  both  for  entire  racial  groups  (Robertson  A Royle, 
1976 — TR  76-6)  and  for  subgroups  matched  on  total  test  score  (Robertson  & 
Montague,  1976 — TR  76-34).  This  report,  the  third  in  a series,  examines 
item  differentiation  and  test  reliability  for  the  present  exams  and  for 
modified  exams  using  alternative  item  selection  procedures. 

The  substantial  and  valuable  assistance  of  the  following  persons  is 
gratefully  acknowledged:  Mr.  William  E.  Montague  and  DP2  Suzanne  Olson, 

for  data  processing  and  computation;  and  Ms.  Hazel  F.  Schwab,  for  clerical 
support. 

This  study  was  performed  under  Exploratory  Development  Task  Area 
ZF55. 521.031  (Career  Performance  and  Selection). 


J.  J.  CLARKIN 
Commanding  Officer 
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Problem 

Blacks  are  advanced  to  paygrades  E-4  and  above  in  smaller  proportions 
than  Whites  and  score  lower  on  the  technical  knowledge  exam  than  do  Whites. 

It  has  been  found  that  when  exams  were  constructed  only  of  items  similar 
in  difficulty  for  both  Blacks  and  Whites  (to  reduce  total  test  score 
differences),  the  items  were  concentrated  in  the  difficult  (i.e.,  guessing) 
range.  This  prior  finding  suggested  that  such  an  approach  would  degrade 
test  quality. 

Purpose 

As  a follow-on,  the  present  study  investigated  test  quality  in  terms 
of  item  differentiation  and  test  reliability.  Questions  specifically 
addressed  were:  (1)  what  racial  differences  in  item  differentiation  exist, 

(2)  what  levels  of  item  difficulty  (P  value)  yield  maximum  item  differentia- 
tion for  Black  and  Whites,  (3)  what  impact  constructing  tests  by  selecting 
particular  types  of  items  would  have  on  item  differentiation,  and  (4) 
what  exam  construction  or  processing  techniques  would  raise  test  quality 
for  Blacks  and  Whites. 

Approach 

Item  response  data  for  exams  of  six  occupational  specialties  across 
four  pay  grades  (i.e.,  24  different  exams)  were  analyzed  as  follows: 

1.  Racial  differences  in  item  differentiation  were  calculated  as  (a) 
the  difference  in  item  difficulty  between  high  and  low  scorers  (D  value) 
and  (b)  the  item-total  score  correlation  (r^„  value) . 

2.  Levels  of  item  difficulty  yielding  maximum  item  differentiation  were 
determined  by  comparing  P_  values  with  corresponding  JO  and  r^t  values. 

3.  Three  types  of  modified  tests  were  developed  by  selecting  different 

types  of  items:  (a)  items  similar  in  difficulty  for  Blacks  and  Whites  (SIM-P) , 

(b)  those  that  were  not  extremely  difficult  (UPA-P),  and  (c)  those  that  were 
highly  correlated  (SEQUIN).  Black-White  score  differences  in  item  differ- 
entiation and  test  reliability  values  for  these  tests  were  compared  with 
those  for  the  original  test. 

4.  The  SEQUIN  item-selection  procedure  was  applied  to  certain  exams 
using  an  on-job  performance  factor  as  a criterion.  Items  correlating  high 
with  internal  (total  score)  and  external  (on-job  performance)  criteria  were 
compared. 

Findings 


1.  Item  differentiation  was  generally  lower  for  Blacks  than  for  Whites, 
partly  because  item-difficulty  O’  value)  distributions  are  lower  for  Blacks 
than  Whites  (p.  7). 
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2.  The  highest  itesh-differentiatlon  values  (D  and  r^t  values)  had 

corresponding  i ten-difficulty  levels  (P  values)  that  were  higher  than  the 
aiedian  JP  values  (of  all  iteas) . This  indicates  that  the  use  of  easier  Iteas 
should  contribute  to  higher  (i.e. , better)  item  differentiation  for  both 
Blacks  and  Whites  (pp.  7 and  11). 

3.  Selecting  iteas  that  were  similar  in  difficulty  for  both  Blacks 
and  Whites  (SIMP-P  test)  did  reduce  aean  score  differences  between  blacks 
and  Whites  but  it  also  reduced  item  differentiation  and  test  reliability. 
Selecting  iteas  that  were  easier  for  Blacks  (UPA-P  test)  and  those  that 
were  highly  correlated  (SEQUIN  test)  resulted  in  slight  and  varied  changes 
in  aean  score  differences  and  also  increased  itea  differentiation  and 
test  reliability  (p.  11). 

4.  The  "best"  items  initially  selected  by  the  SEQUIN  procedure*,  by 

applying  an  internal  criterion  were  not  the  same  as  those  se.i  by 

applying  an  external  criterion.  This  result  raises  new  questions  regarding 
the  relevance  of  internal-consistency  type  measures  of  test  quality  to 
measures  of  subsequent  Job-relevant  performance  (p.  14). 

Conclusions 


1.  Item  differentiation  and  test  reliability  of  advancement  exams 
could  be  improved  for  both  Blacks  and  Whites  by  using  item  selection  and 
construction  procedures  identified  in  this  study. 

2.  Developing  tests  by  using  items  similar  in  difficulty  for  Blacks 

and  Whites  is  not  feasible  since  it  reduces  test  quality.  However , developing 
tests  by  eliminating  excessively  difficult  items  would  improve  test  quality 
and  benefit  Blacks. 

Recommendations 


The  empirical  validity  of  the  present  tests  on  subsequent  job  performance 
should  be  compared  between  Blacks  and  Whites,  and  the  alternative  item  pro- 
cessing and  construction  procedures  identified  herein  should  be  validated 
and  compared  on  internal  and  external  criteria. 
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INTRODUCTION 


Problem  and  Background 

The  Enlisted  Advancement  System  is  one  of  the  Navy’s  major  personnel 
selection  systems  being  studied  to  identify  and  alleviate  any  condition 
that  might  be  detrimental  to  equal  opportunity  in  career  growth  for  all 
individuals  and  groups.  Advancements  to  paygrades  E-4  and  above  are 
competitive  and  are  based  on  several  differentially  weighted  factors, 
including  the  score  obtained  on  a technical  knowledge  exam,  which  is  sub- 
stantially Weighted.  A separate  exam,  comprising  150  multiple-choice 
items,  is  developed  for  each  of  approximately  80  Navy  ratings  (i.e.,  occupa- 
tional specialties)  and  for  each  paygrade  within  each  rating. 

It  has  been  found  that  Blacks  score  lower  than  Whites  on  the  technical 
knowledge  exams,  and  that  a smaller  proportion  of  Blacks  than  Whites  are 
advanced.  To  reduce  the  difference  in  scores,  Robertson  and  Royle  (1975) 
investigated  the  feasibility  of  constructing  exams  containing  only  Items 
that  were  similar  in  difficulty  for  both  Blacks  and  Whites.  They  concluded 
that  the  construction  of  such  tests  could  not  be  recommended,  since  the 
items  of  similar  difficulty  were  concentrated  in  the  difficult  rangec(i.e. , 
in  the  guessing  range).  Although  they  found  that  differences  in  average 
total  test  score  between  Blacks  and  Whites  would  be  reduced  in  tests  con- 
structed of  this  type  of  item,  they  suggested  that  such  tests  would  degrade 
test  quality  for  both  groups.  Thus,  one  aspect  of  the  problem  is  to  find 
ways  of  constructing  advancement  tests  that  provide  similar  competitive 
opportunity  for  all  groups,  but  without  loss  of  test  quality,  as  measured 
by  item  differentiation  or  Internal  consistency-type  reliability. 

Purpose 


This  study  investigated  racial  differences  in  test  quality  in  terms 
of  item  differentiation,1  including  the  effects  from  alternative  item 
selection  techniques. 

The  questions  specifically  addressed  were: 

■1.  What  differences  in  item  differentiation  exist  between  Blacks 
and  Whites? 

2.  What  _P  value  levels  yield  maximum  item  differentiation  for  Blacks 
and  Whites? 

3.  What  impact  would  constructing  tests  by  selecting  particular  types 
of  items  have  on  item  differentiation  and  test  reliability? 

4.  What  exam  construction  or  processing  procedures  would  raise  test 
■---  quality  for  Blacks  and  Whites. 


*The  term  "item  differentiation"  is  used  instead  of  the  term  typically 
used  in  item-analytic  studies,  "item  discrimination,"  to  avoid  confusion 
in  the  context  of  racial  discrimination. 
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METHOD 


Data 


Item  response  data  from  the  technical  knowledge  exams  of  the  Series 
61  (August  1972)  advancement  competitions  were  provided  by  the  Naval 
Examining  Center  (now  the  Naval  Education  and  Training  Program  Develop- 
ment Center,  NETPDC).2  The  ratings  selected  for  analysis  were  those  in 
which  minority  group  representation  was  relatively  high.  The  six  ratings 
selected,  in  competition  to  paygrades  4 through  7,  were: 

Aviation  Machinist’s  Mate  (Jet  Engine  Mechanic)  (ADJ) 

Boatswain's  Mate  (BM) 

Boile“  Technician  (BT) 

Commissaryman  (CS) 

Hospital  Corpsman  (HM) 

Machinist's  Mate  (MM) 

Data  (of  Blacks  and  White  only)  for  the  24  separate  competing  groups  were 
analyzed.  Table  1 presents  the  sample  size,  total  test  mean,  and  standard 
deviation  for  each  group. 

Analysis 

Racial  Differences  in  Item  Differentiation 


Item  differentiation  is  considered  more  important  than  item- 
difficulty  in  constructing  tests  from  "good"  Items;  that  is,  those  that 
are  neither  extremely  easy  nor  difficult  (e.g.,  Rvalues  between  40  and  80) 
and  that  relate  to  the  total  test  score  either  by  a high  positive  correla- 
tion or  by  higher  proportions  of  high  than  low  scorers  answering  the  item 
correctly.  P.  values  of  medium  difficulty  place  upper  limits  on  the  rela- 
tionship of  an  item  to  total  test  score,  but  do  not  guarantee  effective 
item  differentiation  (Nunnally,  1967).  The  r_^t  and  D value  statistics 

were  applied  to  selected  items  of  some  of  the  exams  to  examine  racial 

differences  in  item  differentiation.  The  r,  statistics  were  obtained 

-it 

by  calculating  a Pearson  product-moment  correlation  between  each  individual's 
right-wrong  response  to  an  item  and  total  test  score,  yielding  a point 
biserial  coefficient.  The  1)  value  statistic  was  calculated  by  rank-ordering 
total  scores  and  splitting  them  at  the  median,  creating  two  subgroups — those 
who  scored  High  on  the  total  test  score  and  those  who  scored  low.  jD  values 
were  obtained  by  subtracting  the  percentage  of  high  scorers  who  answered 
the  item  correctly  from  the  percentage  of  low  scorers  who  answered  the  item 
correctly.  Details  of  these  procedures  and  differences  between  them  are 
discussed  in  the  \ppendix. 


?This  data  set  was  also  used  in  previous  studies  of  this  '-cries 
(i.e.,  Robertson  & Royle,  L975  and  Robertson  & Montague,  1976). 
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Table  1 

Advancement  Cxam  Sample  Sizes,  Means, 
And  Standard  Deviations  by  Race 


Competition  to 

Race 

Pay 

Rate 

Black 

White 

Grade 

N 

X 

SD 

N 

X 

SD 

ADJ3 

47 

52.38 

12.60 

644 

69.96 

14.75 

BM3 

83 

58.07 

9.38 

1033 

64.15 

11.86 

A 

riT3 

33 

61.76 

13.37 

831 

73.77 

16.68 

CS3 

27 

67.59 

10.15 

447 

76.12 

11.76 

HM3 

104 

68.00 

11.17 

1429 

73.45 

15.53 

MM3 

58 

62.48 

12.26 

1259 

72,44 

16.56 

ADJ2 

30 

58.27 

14,39 

565 

63.55 

15.01 

BM2 

74 

60.12 

11.70 

569 

63.43 

10.56 

c 

BT2 

28 

60.11 

10,25 

511 

73.61 

16.57 

D 

CS2 

47 

64,00 

11,41 

412 

69.01 

10.66 

HM2 

111 

63.60 

9,43 

1391 

70.27 

13.40 

MM2 

30 

56.37 

13.69 

984 

74.09 

15.95 

ADJ1 

50 

67.78 

15.56 

400 

72.31 

15.19 

EMI 

115 

66.33 

11.18 

502 

72.31 

11.49 

£. 

BT1 

79 

70.44 

13.57 

495 

80.70 

17.18 

0 

CS1 

127 

68.27 

12.22 

661 

72.04 

11.78 

HM1 

26 

68.58 

6.87 

546 

71.32 

11.08 

MM1 

62 

62  44 

11.26 

774 

75.39 

14.04 

ADJC 

88 

66.77 

14.23 

1014 

70.07 

14.50 

BMC 

193 

63.60 

12.42 

1103 

65.75 

10.87 

7 

BTC 

138 

77.91 

17.61 

956 

80.57 

15.59 

CSC 

165 

63.01 

14,24 

771 

65.58 

13.92 

HMC 

157 

71,24 

13.73 

1817 

70.75 

1 o . 02 

NWC 

110 

75. 3S 

13.81 

1547 

78.73 

13.63 
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Effects  of  Item-Difficulty  (P  Value)  on  Item  Differentiation 

Although  £ values  of  medium  difficulty  generally  produce  the  most 
differentiating  items,  the  literature  is  not  in  full  agreement  as  to  what 
the  ideal  1?  value  or  range  of  £ values  should  be.  Thus,  to  investigate 
the  relationship  between  item-difficulty  and  item  differentiation,  JD 
values  were  rank  ordered,  seven-item  sets  w re  extracted  from  the  top 
ranks,  and  the  corresponding  £ values  for  the  I)  values  were  identified. 
Similarily,  P.  values  were  rank  ordered;  seven-item  sets  were  extracted 
from  the  top,  middle,  and  bottom  of  the  ranks;  and  the  corresponding  I) 
values  were  identified.  Finally,  r^t  values  were  rank  ordered,  and  the 

P values  for  the  highest  and  lowest  nine  r.  values  were  identified.  All 
— — it 

of  the  above  statistics  were  computed  separately  for  Blacks  and  Whites 
and  then  compared  for  racial  differences. 

Effects  of  Item  Selection  Procedures 


To  compare  the  impacts  on  test  reliability  and  item  differentiation 
from  alternative  methods  of  item  selection,  the  following  three  types 
of  tests  were  simulated  and  comparative  statistics  computed: 

1.  The  similar  £ value  (SIM-P)  method,  developed  by  Robertson 
and  Royle  (1975),  which  selects  only  those  items  having  a White  £ value 
that  is  not  significantly  greater  than  the  Black  £ value. 

2.  The  upgraded  J?  value  (UPA-P)  method,  developed  by  Robertson 
and  Royle  (1975),  which  selects  only  those  items  having  a Black  £ value 
greater  than  25. 

3.  The  SEQUIN  method,  developed  by  Moonan,  Balaban,  and  Geyser 
(1967),  which  sequentially  identifies  and  selects  items  with  high  correla- 
tions to  maximize  a least  squares  prediction  of  a criterion  of  total  score. 
This  "heuristic"  method  selects  items  in  an  "accretion"  procedure.  The 
first  item  selected  is  the  one  that  correlates  most  highly  with  a specified 
criterion;  subsequent  items  selected  are  those  whose  intercorrelations 

with  the  items  already  nominated  tend  to  maximize  the  correlation  coefficient 
in  a regression  equation. 

Internal  consistency  reliabilities  (Kuder-Richardson  type,  Ghiselll, 
1964,  Formula  9-19)  were  recalculated  for  the  new  shor toned  tests,  and 
compared  with  those  of  the  original  (GRIG)  150-item  test.  The  obtained 
values  for  the  shortened  tests  were  corrected  by  the  Spearman-Brown  Formula 
(Ghiselll,  1964,  Formula  9-4)  to  provide  comparisons  of  150-item  tests. 

Means  and  standard  deviations  were  recalculated  separately  for 
Blacks  and  Whites  for  the  shortened  tests  and  compared  with  those  of  the 
original  test. 
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Effects  of  Exam  Construction  and  Processing  Procedures 


To  examine  alternative  test  construction  or  processing  procedures 
that  might  raise  test  quality,  a concurrent  measure  of  on-job  performance 
was  used.  Since  no  longitudinal  type  of  external  criterion  was  available 
for  the  present  analysis,  such  as  a measure  of  technical  job  performance 
at  the  next  higher  paygrade,  the  Performance  Factor  in  the  composite 
for  advancement  competition  was  utilized  for  illustrative  purposes. 

(Since  this  factor  is  a measure  of  present  rather  than  subsequent  job 
performance,  and  includes  evaluation  of  interpersonal  behaviors,  such 
as  leadership  and  conduct,  in  addition  to  technical  effectiveness,  its 
use  for  illustrative  purposes  only  is  emphasized.) 

The  SEQUIN  item-select  ■ion  procedure  was  applied  to  the  ADJ3  and  BM2 
Exams  with  the  Performance  Factor  as  a criterion.  Items  selected  early 
and  late  in  the  sequential  procedure  by  two  types  of  criteria — internal 
(total  score)  and  external  (on-job  performance) — were  then  compared  to 
determine  characteristics  of  valid  items  in  predicting  job  performance. 
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RESULTS 


Racial  Differences  in  Item  Differentiation 

Black  D values  were  found  to  be  lower  than  White  £ values  in  18  of  the 
24  rate  groups  (see  median  difference  column  of  Table  2).  A rank  order 
correlation  between  the  median  difference  and  Black  sample  size  of  -.42 
indicates  that  the  differences  are  partly  attributable  to  the  small  Black 
sample  sizes  (i.e.,  the  largest  differences  tend  to  be  associated  with  the 
smallest  Black  samples) . 

Table  3 illustrates  the  racial  differences  in  item  differentiation 

in  terms  of  both  D value  and  r,  differences  for  20  items  in  the  ADJ3 

— it 

Exam.  As  shown,  Black  I)  values  were  more  than  10  percentage  points  lower 
than  White  I)  values  on  8 items,  while  White  I)  values  were  lower  on  4 items. 
(An  inspection  of  all  Black-White  I)  value  differences  revealed  that,  in 
16  exams,  Whites  were  the  higher  in  a majority  of  those  items  with  differ- 
ences of  at  least  10  percentage  points;  in  2 exams,  Blacks  were  the  higher; 
and  in  the  remaining  6 exams,  the  frequency  with  Blacks  higher  and  Whites 
higher  was  about  equal.)  On  the  ADJ3  Exam,  employing  the  r to  Z trans- 
formation (Hays,  1963,  Formula  15.26.6),  Black  and  White  values  were 

significantly  different  for  only  12  out  of  150  items,  which  is  only  4 items 
more  than  would  be  expected  by  chance.  Of  these  12  items,  Blacks  were 
lower  on  8. 

One  possible  reason  for  the  lower  Black  item  differentiation  might  be 
the  finding  in  the  Robertson  and  Royle  (1975)  study  that  larger  propor- 
tions of  Black  than  White  £ values  are  concentrated  in  or  near  the  guessing 
range  (where  item  differentiation  is  poorest).  The  Rvalues  for  Item  30 
in  Table  3 tend  to  support  this  hypothesis,  since  the  Black  JP  value  is 
in  the  guessing  range,  but  the  £ values  for  Item  16  do  not. 

Effects  of  Item-Difficulty  (P  Value)  on  Item  Differentiation 

Since  P values  of  medium  difficulty  should  yield  the  highest  I)  values, 
it  is  of  interest  to  compare  the  corresponding  £ values  of  the  highest 
£ values  with  the  median  £ value  of  the  total  test  (see  Table  4).  As 
shown,  the  corresponding  median  P value  of  the  highest  £ values  is  higher 
than  the  total  test  median  £ value  in  18  of  the  24  rate  groups  for  both 
Blacks  and  Whites.  (The  six  exceptions  are:  Black — CS3,  BM2,  ADJ1,  MM1, 

BTC,  and  HMC;  and  White— MM3,  BT2,  BT1,  HM1,  BTC,  and  HMC.)  For  example,  the 
corresponding  median  £ value,  42.55,  for  the  highest  £ values  of  the  ADJ3 
Rlack  Group  is  substantially  greater  than  the  total  test  median  £ value, 

34.0,  for  that  group. 

Similar  results  were  obtained  from  examining  the  corresponding  £ 
values  for  high  and  low  values,  and  from  reversing  the  orientation 

and  comparing  high  and  low  £ values  and  their  corresponding  £ values. 

These  results  are  presented  in  greater  detail  in  the  Appendix. 
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Table  2 

Range  and  Median  D Values 


Rate 

Blacks 

Whites 

Median 

Diff. 

Rank  of 
Diff. 

N 

Range 

Median 

N 

Range 

Median 

ADJ3 

47 

-8.18-64.73 

22.83 

644 

5.46-54.47 

24.82 

-1.99 

15 

BM3 

83 

-9.23-48.25 

17.86 

1033 

3.38-44.55 

21.48 

-3.62 

21 

BT3 

33 

-22.22-69.92 

21.11 

831 

3.84-50.83 

25.35 

-4.24 

23 

CS3 

27 

-26.14-72.73 

19.50 

447 

-0.08-50.05 

21.50 

-2.00 

16 

HM3 

104 

-13.47-48.90 

19.51 

1429 

3.26-44.73 

23.46 

-3.95 

22 

MM3 

58 

-1.43-51.67 

20.97 

1259 

0.35-50.87 

24.58 

-3.61 

20 

ADJ3 

30 

-24.43-75.00 

22.62 

565 

5.89-44.28 

24.09 

-1.47 

13 

BM2 

74 

-11.01-48.96 

21.64 

569 

1.07-39.65 

21.11 

0.53 

5 

BT2 

28 

-23.59-72.31 

21.54 

511 

-6.25-49.55 

24.67 

-3.13 

18 

CS2 

47 

-23.09-63.45 

20.05 

412 

-3. 58-37 . 34 

19.05 

1.00 

3 

HM2 

111 

-17.89-47.89 

16.91 

1391 

-0.03-44.30 

22.02 

-5.11 

24 

MM2 

30 

-19.64-60.00 

21.72 

984 

3.03-50.03 

24.86 

-3.14 

19 

A0J1 

50 

-17.90-59.03 

25.76 

400 

2.14-52,23 

26.06 

-0.30 

7 

BM1 

115 

-4.48-45.61 

22.12 

502 

2.36-36.97 

21.14 

0.98 

4 

BT1 

79 

-3.23-48.90 

23.45 

495 

0.27-48.84 

25.33 

-1.88 

14 

CS1 

127 

-6.58-48.22 

20.86 

661 

2.32-40.53 

21.99 

-1.13 

9 

HM1 

26 

-28.57-65.00 

17.50 

546 

1.67-44.82 

18.84 

-1.34 

12 

mi 

62 

-10.71-51.04 

22.32 

774 

-0.75-45.44 

24.33 

-2,01 

17 

ADJC 

88 

-15.80-56.15 

24.42 

1014 

0.79-54.24 

25.61 

-1.19 

10 

BMC 

193 

-2.13-44.97 

22.54 

1103 

-1.21-39.94 

20.57 

1.97 

1 

BTC 

138 

2.18-53.61 

23.43 

956 

1.90-42.13 

24.63 

-1.20 

11 

CSC 

165 

1.33-56.48 

22.81 

771 

-8.92-50.74 

22.45 

0.36 

6 

HMC 

157 

-1.89-51.43 

22.05 

1817 

0.47-46.30 

20.66 

1.39 

2 

ULir 

ITaj 

110 

-22.04-57.91 

24.02 

1547 

0.22-43.14 

24.82 

-0.80 

8 

Note.  Largest  positive  difference  was  assigned  Rank  1. 
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Table  3 


Racial  Differences  in  Item  Differentiation 
For  20  Selected  Items  of  the  ADJ3  Exam 


B Minus  W 


Item 

Black 

White 

Difference 

No. 

P Value 

Dg  Value 

At 

£ Value 

D Value 
-s 

At 

D a 
— s 

Z Testb 

11 

21.28 

7.61 

.330 

34.32 

25.42 

.287 

-17.81 

.865 

12 

31.91 

19.93 

-.021 

26.71 

11.10 

.036 

8.83 

-.359 

13 

42.55 

23.73 

.346 

58.54 

20.88 

.143 

2.85 

1.392 

14 

34.04 

41.12 

.574 

73.45 

23.39 

.315 

17.73 

2.101* 

15 

34.04 

7.07 

.028 

38.51 

9.13 

.013 

-2.06 

.096 

16 

46.81 

-1.99 

.176 

51.55 

35.18 

.306 

-37.17 

-.887 

17 

46.81 

32.07 

.228 

61.49 

23.80 

.241 

8.27 

-.089 

18 

25.53 

-1.09 

-.026 

29.97 

8.45 

.049 

-9.54 

-.481 

19 

27.66 

36.21 

.126 

19.41 

17.22 

.168 

18.99 

-.275 

20 

78.72 

34.48 

.108 

72.98 

29.56 

.243 

4.92 

-.896 

21 

21.28 

1.53 

.063 

23.76 

16.68 

.074 

-15.15 

-.071 

22 

19.15 

50.00 

.041 

36.65 

36.58 

.343 

13.42 

-2.031* 

23 

34.04 

25.86 

.317 

47.36 

45.53 

.387 

-19.67 

-.513 

24 

36.17 

40.42 

.140 

49.07 

32.54 

.356 

7.88 

-1.485 

25 

38.30 

9.96 

.018 

42.86 

34.56 

.340 

-24.60 

-2.157* 

26 

42.55 

57.09 

.474 

49.22 

26.60 

.272 

30.49 

1.516 

27 

29.79 

32.76 

.387 

61.49 

26.00 

.266 

6.76 

.871 

28 

34.04 

16.86 

.115 

44.88 

31.50 

.182 

-14.64 

-.440 

29 

34.04 

-1.15 

.201 

33.54 

20.76 

.211 

-21.91 

-.067 

30 

21.28 

10.54 

.314 

55.75 

54.47 

.501 

-43.93 

-1.449 

Differences  greater  than  25.00  are  underlined. 

^Significance  of  difference  between  two  _r\  t correlations  tested  using 

r to  Z transformation  — (Hays,  1963,  formula  15.26.6). 

— a(Zi~Z2) 

*Two-tail  test,  P < .05. 
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These  results  Indicate  that  item  differentiation  would  be  improved 
for  both  Blacks  and  Whites  by  the  construction  of  tests  using  items  that 
are  generally  easier,  and  particularly,  with  less  concentration  of  items 
near  the  guessing  range.  The  results  tend  to  support  those  of  Tinkelman 
(1971),  who  proposed  a Rvalue  of  .75  as  the  optimum  average  item-difficulty 
for  items  with  four  options,  because  the  error  variance  due  to  chance  tends 
to  be  greater  when  guessing  occurs. 

Effects  of  Item  Selection  Procedures 


Table  5 presents,  for  five  rate  groups,  the  effects  on  mean  score, 

P,  value,  and  I)  value  from  employing  two  types  of  tests — SIM-P  and  UPA-P. 

(The  median  J3  value  of  the  SIM-P  test  is  probably  an  overestimate,  and 
that  of  the  UPA-P  test,  an  underestimate,  because  each  is  based  on  the 
remaining  D values,  rather  than  rescoring  section  scores  and  recalculating 
new  J)  values.)  Compared  with  the  original  operational  tests  (ORIG),  it 
was  found  that: 

1.  The  SIM-P  tests  substantially  reduced  Black-White  differences  in 
mean  score  and  JP  value  (e.g.,  for  ADJ3  in  Table  5,  mean  score  differences 
were  reduced  from  17.58  to  3.35;  and  P value  differences,  from  11.8  to  3.9) 
in  all  five  rate  groups.  However,  median  D values,  as  a measure  of  test 
quality,  were  reduced  in  two  of  the  five  Black  gr  nips  and  four  of  the  five 
White  groups  (e.g.,  for  HM2,  BlacK  median  I)  value  remained  at  16.9;  but 
that  for  Whites  was  reduced  from  22.0  to  20.3). 

2.  The  UPA-P  tests  produced  slight  and  varied  Black-White  differences 
in  mean  score  and  J?  value  (e.g.,  for  MM3  in  Table  5,  the  mean  score  differ- 
ence changed  from  9.96  to  9.86),  but  Black  and  White  median  D values  all 
increased  (e.g.,  BM2  Black  group,  from  39.2  to  46.0). 

Table  6 compares  the  SIM-P,  UPA-P,  and  SEQUIN  types  of  tests  with  the 
original  tests  in  regard  to  test  reliability  and  Black-White  mean  difference. 
The  SIM-P  tests  reduced  reliability  substantially  in  some  rate  groups 
(e.g.,  for  ADJ3,  in  the  corrected  r column  for  test  length  of  150  items, 

reliability  decreased  from  .863  to  .702),  and  slightly  in  others  (e.g., 
for  BM2,  from  .729  to  .726).  The  UPA-P  and  SEQUIN  tests  both  increased 
reliability  slightly.  Thus,  SIM-P  type  tests  reduced  Black-White  differ- 
ences in  mean  score  but  at  a probably  unacceptable  cost  in  reduced  test 
quality  for  both  Blacks  and  Whites.  (The  results  of  the  present  study, 
using  test  quality  measures  of  item  differentiation  and  reliability, 
provide  empirical  support  for  the  conclusion  of  reduced  test  quality 
reached  in  the  Robertson  and  Royle  (1975)  study.)  The  effects  of  UPA-P 
and  SEQUIN  tests  on  Black-White  mean  score  differences  are  slight  and 
varied.  Test  quality  (i.e.,  reliability)  usually  is  increased  slightly. 

Such  increases  in  reliability  occur  most  likely  because  the  reliabilities 
are  already  quite  high — usually  in  the  high  ,80's.  In  the  one  exception, 

BM2,  there  is  a modest  increase  from  the  relatively  low  .729  to  .764 
(for  UPA-P)  and  .769  (for  SEQUIN). 
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Hi'iin  liit.il  Score,  Median  £ Value,  .mil  0 Value 
lly  ll.irc  on  Three  Types  of  Tests 


Rate 

Type 

Black 

White 

B Minus  W Difference 

Group 

Test 

X 

Median 

Median 

X 

Median 

Median 

X 

Median 

Median 

Total 

P Value 

D Value 

Total 

P Value 

D Value 

Total 

P Value 

D Value 

ADJ3 

0R1C\  d 

SIM-Pb*d 

52.38 

34.0 

22.8 

69.96 

45.8 

24.8 

-17.58 

-11.8 

-2.0 

55.16 

36.2 

20.9 

58.51 

40.1 

21.1 

-3.35 

-3.9 

-0.2 

UPA-PC'“ 

60.19 

38.3 

24.4 

77.52 

51.3 

24.8 

-17.33 

-13.0 

-0.4 

SIM-P  minus 
PRIG 

2.78 

2.2 

-1.9 

-11.45 

-5.7 

-3.7 

... 

... 

... 

UI’A-P  einus 
ORIu 

7.81 

4.3 

1.6 

7.56 

5.5 

0 

... 

... 

... 

HHJ 

ORIUa.  H 
S1M-Pb,d 

68.00 

45.2 

19.5 

73.45 

49.6 

23.5 

-5.45 

-4.4 

-4.0 

72.35 

48.1 

19.5 

74.39 

50.3 

22.7 

-2.04 

-2.2 

-3.2 

UPA-Pc,d 

76.10 

49.0 

20.3 

81.34 

53.1 

24.6 

-5.24 

-4.1 

-4.3 

SIM-P  Minus 
ORIG 

4.35 

2.9 

0 

.94 

.7 

-.8 

... 

... 

... 

UPA-P  minus 
ORIG 

8.10 

3.8 

.8 

7.89 

3.5 

1.1 

... 

... 

... 

MU 

0R1G\  . 
SIM-Pb,d 
UPA-PC,d 

62.48 

39.7 

21.0 

72.44 

48.3 

24.6 

-9.96 

-8.6 

-3.6 

66. S9 

41.4 

20.1 

68.99 

44.6 

22.7 

-2.4 

-3.2 

-2.6 

67.18 

41.4 

23.3 

77.04 

SO. 3 

25.8 

-9.86 

-8.9 

-2.5 

SIM-P  minus 
PRIG 

4.11 

1.7 

-.9 

-3.45 

-3.7 

-1.9 

... 

... 

... 

UPA-P  Minus 
PRIG 

ORIG*.  . 
SIM-Pb  j 
UPA-Pc,d 
SIM-P  Minus 
ORIG 

UPA-P  Minus 
PRIG 

prig“k  , 

S1M-Pb,d 
UPA-PC* 
SIM-P  minus 
ORIG 

UPA-P  Minus 
PRIG 


60.12 

39.2 

21.6 

63.43 

43.2 

21.1 

-3.31 

-3.9 

O.S 

61.03 

41.2 

21.6 

61.34 

41.5 

21.1 

-.31 

-0.3 

0.5 

69.27 

46.0 

24.4 

72.24 

48.0 

22.2 

-2.97 

-2.0 

2.2 

.91 

2.0 

0 

-2.09 

-1.7 

0 

— 

... 

... 

9.15 

6.8 

2.8 

8.81 

4.8 

1.1 

... 

... 

... 

63.60 

41.4 

16.9 

70.27 

46.3 

22.0 

-6.67 

-4.9 

-5.1 

67.83 

45.0 

16.9 

69.67 

45.8 

20.3 

•1.84 

-0.8 

-3.4 

70.84 

45.1 

19.0 

76.52 

49.9 

22.3 

-5.68 

•4.8 

-3.3 

4.23 

3.6 

0 

-.6 

-.5 

-1.7 

... 

... 

... 

7.24 

3.7 

2.1 

6.25 

3.6 

-.3 

... 

... 

... 

‘includes  the  complete  set  of  ISO  items. 

^Includes  only  items  in  which  the  BUck  P value  was  not  significantly  less  than  than  the  White  P value. 


Includes  only  items  m which  the  Black  P value  was  greater  than  .25. 
dMean  total  scores  are  simulated  by  obtained  SIM-P  or  UPA-P  score  times 


N items  in  original  test 
N items  in  simulated  test 
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Table  6 


Reliability,  Mean,  and  Standard  Deviation  of  Four  Typea  of  Tests 


Rate  Group 

Type 

Reliability 

Block 

White 

B-W 

and  N 

Test 

N* 

c 

-XX 

X 

SO 

X 

SO 

xd 

Ilack/Khite 

Items 

Obt1* 

Cor.c 

Oif. 

ORIG 

ISO 

.863 

.863 

52.38 

12.60 

69.96 

14.75 

-1.285 

SiM-P 

74 

.S38 

.702 

27.21 

5.18 

28.86 

5.89 

• -0.298 

UPA-P 

114 

.830 

.86S 

45.75 

11.49 

58.92 

11.78 

-1.131 

SEQUIN 

12S 

.8S4 

.875 

44.66 

11.21 

60.24 

13.19 

-1.277 

ADJ3 

SIM-P  Minus 

-.161 

47/644 

OklG 

UPA-P  Minus 
ORIG 

.002 

SEQUIN  Minus 

012 

ORIG 

ORIG 

149 

.870 

.870 

68.00 

11.17 

73.45 

15.53 

-0.408 

SIM-P 

US 

.829 

.863 

S5.85 

8.47 

57.41 

11.85 

-0.153 

UPA-P 

126 

.867 

.885 

64.36 

10.18 

68.79 

14.45 

-0.359 

SEQUIN 

125 

.868 

.887 

59.69 

10.69 

64.55 

14.48 

-0.386 

HM3 

SIM-P  Minus 

-.007 

104/1429 

ORIG 

UPA-P  Minus 
ORIG 

,015 

SEQUIN  Minus 
ORIG 

.017 

ORIG 

ISO 

.884 

.884 

62.48 

12.26 

72.44 

16.56 

-0.691 

SIM-P 

95 

.784 

.851 

42.17 

7.48 

43.69 

9.78 

-0.176 

UPA-P 

133 

.878 

.890 

59.57 

11.52 

68.31 

15.48 

-0.647 

SEQUIN 

12S 

.879 

.897 

53.29 

10.75 

62.29 

15.07 

-0.697 

MU 

SIM-P  Minus 

.033 

S8/12S9 

ORIG 

UPA-P  minus 
ORIG 

.006 

SEQUIN  minus 
ORIG 

.013 

ORIG 

ISO 

.729 

.729 

60.12 

11.70 

63.43 

10.56 

-0.297 

SIM-P 

126 

.690 

.726 

51.28 

9.98 

51.55 

9.08 

-0.028 

UPA-P 

119 

.720 

.764 

54.97 

10.55 

57.33 

9.67 

-0.253 

SEQUIN 

125 

.736 

,769 

53.23 

11.11 

56.64 

10.02 

-0.322 

BM2 

SIM-P  minus 

-.003 

74/S69 

ORIG 

UPA-P  minus 
ORIG 

.035 

SEQUIN  minus 
ORIG 

.040 

ORIG 

149 

.820 

.820 

63.60 

9.43 

70.27 

13.40 

-0.584 

SIM-P 

109 

.710 

.771 

49.66 

7.02 

51.00 

8.92 

-0. 168 

UPA-P 

12S 

.800 

.827 

59.43 

8.57 

64.57 

11.84 

-0.503 

SEQUIN 

1 25 

.814 

.840 

53.41 

8.70 

59.52 

12.30 

-0.S81 

HM2 

SIM-P  minus 

-.049 

111/1391 

ORIG 

UPA-P  minus 
ORIG 

SEQUIN  minus 

.007 

.020 

ORIG 

eItems  remaining  after  deletion.- 
^Obtained  (Obt.)  value. 

CCorrected  (Cor.)  value  for  a test  of  ISO  items  (Nunnally,  1967, 

*6  01^erence'*the  wean  difference  vn  uar.dard  deviation  units, 
Calculated  on  Black  and  White  groups  combined. 


Formula  7-6,  p.  223) . 

X - * 

calculat?d  bX  SB"  ♦ ~SD 
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Effects  of  Exam  Construction  and  Processing  Procedures 

When  the  Performance  Factor  was  employed  as  representative  of  an 
external,  job-relevant  criterion,  the  SEQUIN  procedure  reached  a maximum 
validity  with  a small  subset  of  items.  For  the  ADJ3  Exam,  the  value  of 
the  validity  coefficient  rose  rapidly  to  a maximum  of  .206  with  the  selec- 
tion of  the  20  most  valid  items  (see  Figure  la),  then  tapered  off  to  a slight 
negative  validity  of  -.031  for  all  150  items.  Similarly,  for  the  BM2 
Exam,  the  validity  coefficient  reached  a peak  of  .273  for  30  items,  and 
a final  value  of  .016.  Compared  to  the  validity  coefficient,  the  value 
of  the  reliability  coefficient,  which  is  largely  a function  of  the  number 
of  items  in  a test,  continued  to  rise  steadily  (see  Figure  lb)  during  the 
selection  of  the  first  100  items  and  leveled  off  with  the  selection  of  the 
"best"  120  items. 

Since  SEQUIN  also  identifies  the  specific  items  selected  in  the  "accre- 
tion" process,  it  was  possible  to  categorize  items  according  to  content  and 
compare  items  selected  early  and  late  in  the  process.  In  the  selection  of 
items  from  the  ADJ3  Exam  (see  Table  7),  twice  the  proportion  of  theoretical 
items  occurred  in  the  last  25  (i.e.,  least  valid)  items  as  in  the  first  25 
(i.e.,  most  valid),  although  this  16  percentage  point  difference  was  not 
significant  when  a chi  square  test  was  applied. 

Comparing  the  ADJ3  Exam  items  selected  by  both  an  internal  and  an 
external  criterion,  items  with  the  14  lowest  item-total  correlations  were 
identified  (r^t  < .050).  With  the  internal  criterion,  11  of  the  14  items 

were  among  the  last  third  of  the  items  to  be  selected  (see  Table  8) . 

However,  with  the  external  criterion  (the  Performance  Factor),  12  of  the 
14  items  were  in  the  first  third  of  the  items  selected.  Particularly, 
three  of  the  items  with  both  a very  low  £ value  and  value  were  among 

those  selected  earliest — fifth,  seventh,  and  thirteenth — by  the  external 
criterion. 

Similar  results  were  obtained  on  the  BM2  Exam  (see  Table  9).  Twelve 
of  the  15  items  with  the  lowest  item-total  correlations  were  among  the 
first  third  of  items  selected  by  the  external  criterion,  with  six  of  those 
items  among  the  first  24. 


RELIABILITY  VALIDITY 


.VALIDITY 


1 5 10  15  20  25  35  50  70  90  120  150 

NUMBER  OF  BEST  ITEMS 


Figure  1.  Illustration  of  selection  of  most  valid  items  by 
SEQUIN  (ADJ3  Exam.). 
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Table  7 

Proportions  of  Theoretical  and  Applied  Type 
Items  in  25  Most  and  Least  Valid  Items 
Selected  by  SEQUIN  (ADJ3  Exam) 


Items  Selected  by  SEQUIN 

Item  Content 

Most  Valid 

Least  Valid 

All  Items 

Category 

25  Items 

25  Items 

(150) 

N % 

N % 

N % 

Theoretical 

4 

16 

8 

32 

32 

21 

Applied 

20 

80 

16 

64 

no 

73 

Indeterminant 

1 

4 

1 

4 

8 

5 

Note.  For  a 2 x 2 Matrix  of  only  those  items  identified 


as  theoretical  or  applied. 


4 

8 

20 

16 

X2  = 1.0,  .50  > p > .25. 


Table  8 


Comparison  Between  Internal  and  External  Criteria 
Of  SEQUIN  Item  Accretion  of  Lowest  Ite.n- 
Differentiation  Values  (ADJ3  Exam) 


Item 

No. 

P Value 

Lowest  r.  * 

—it 

(<  .050) 

Sequence  in  which 

Item  was  Selected  by: 

Internal  Criterion 
(Total  Score) 

External  Criterion 
(Performance  Factor) 

120 

.333 

.028 

61 

46 

10S 

.372 

.043 

65 

62 

15 

.382 

.020 

92 

16 

128 

.423 

.034 

103 

45 

135 

.425 

-.009 

104 

17 

118 

.195 

.041 

106 

39 

125 

.370 

.047 

109 

4 

71 

.124 

-.022 

118 

7 

55 

.534 

-.043 

123 

117 

18 

.297 

.050 

128 

34 

45 

.161 

-.022 

136 

13 

97 

.465 

-.054 

142 

37 

12 

.271 

.023 

147 

33 

113 

.100 

-.030 

150 

5 

Note.  N = 691  (47  Black  and  644  White  combined). 


g 

Values  are  slight  overestimates,  since  item  is  included  in  total 
score. 
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Table  9 


Comparison  Between  Internal  and  External  Criteria 
Of  SEQUIN  Item  Accretion  of  Lowest  Item- 
Differentiation  Values  (BM2  Exam) 


Item 

No. 

P Value 

Lowest  r.  a 
— it 

(i  .050) 

Sequence  in  which 

Item  was  Selected  by: 

Internal  Criterion 
(Total  Score) 

External  Criterion 
(Performance  Factor) 

93 

.661 

.039 

52 

34 

115 

.295 

.024 

59 

22 

60 

.199 

.003 

99 

107 

5 

.591 

.007 

100 

76 

139 

.212 

-.019 

110 

43 

37 

.292 

.041 

115 

14 

98 

.215 

-.060 

132 

32 

107 

.104 

-.037 

137 

31 

18 

.267 

-.019 

138 

136 

131 

.117 

-.090 

143 

6 

81 

.152 

.031 

145 

115 

19 

.093 

-.020 

147 

24 

130 

.070 

.025 

148 

15 

123 

.065 

-.083 

149 

62 

73 

.059 

-.048 

150 

16 

Note.  N a 643  (74  Black  and  569  White  combined). 


Values  are  slight  overestimates,  since  item  is  included  in  total 
score. 
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DISCUSSION 

Procedures  for  Improving  Advancement  Tests 

The  problem  of  how  to  improve  enlisted  advancement  exams  is  discussed 
in  the  light  of  the  results  reported  above,  the  reality  of  the  administra- 
tion and  use  cf  the  tests,  and  the  desirability  of  achieving  one  or  more  of 
three  objectives — (1)  increasing  test  reliability,  (2)  increasing  test 
validity,  and  (3)  decreasing  Black-White  score  differences.  It  is,  of 
course,  easier  to  state  an  objective  than  to  achieve  it.  Even  when  the 
rules  of  good  item  construction  are  followed,  there  is  no  assurance  that 
the  item  characteristics  desired  will  be  achieved,  unless  the  items  are 
pretested.  Nunnally  (1967)  suggests  pretesting  at  least  twice  as  many 
items  as  are  intended  for  the  final  test.  Although  such  a procedure  may 
be  ideal,  there  are  practical  limitations  in  regards  to  the  development 
of  Navy  enlisted  advancement  exams.  Advancement  is  intensely  competitive, 
particularly  in  the  higher  paygrades  where  the  proportion  of  openings  is 
much  smaller  than  the  proportion  of  highly  qualified  candidates  available. 
If  items  were  pretested  on  a sample  group,  the  examinees  in  the  sample 
group  might  have  the  advantage  of  being  alerted  to  the  specific  content 
of  the  forthcoming  exam.  Also,  the  £ values  would  probably  be  lower  in 
the  pretest  than  in  the  operational  test,  since  the  pretest  examinees  would 
not  be  motivated  to  study  as  intensely  as  they  would  for  the  operational 
test. 

In  lieu  of  a pretesting  procedure,  the  tests  could  be  improved  by 
the  empl  yment  of  for  other  procedures: 

1.  Test  validation  on  an  external,  job-relevant  criterion. 

2.  Identification  of  the  most  and  least  valid  items,  and  a content 
categorization  of  the  items  identified. 

3.  Utilization  of  item  construction  procedures  that  tend  to  produce 
items  wi;h  the  desired  characteristics  (e.g.,  having  specified  levels 

of  item  difficulty,  differentiation,  and  validity). 

4.  Post  hoc  item  deletion  procedures  that  eliminate  undesirable  items 
after  administration  but  prior  to  final  scoring. 

Each  of  these  four  approaches  is  discussed  in  detail  below. 

Test  Validation 


The  primary  concern  wiih  a personnel  selection  test  is,  of  course, 
its  relevance  to  the  purpose  of  the  selection — in  the  present  case,  to 
the  individual's  effectiveness  in  the  next  higher  grade  for  which  selected. 
The  measures  of  test  quality  investigated  in  the  present  study — test 
reliability  and  item  differentiation — are  important  to  test  validity  (by 
setting  upper  limits  on  it)  but  do  not  of  themselves  assure  test  validity. 
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Validation  of  the  advancement  exams  on  job-relevant  criteria  is 
needed  for  two  reasons.  First,  the  courts  are  becoming  increasingly  in- 
sistent on  empirical  evidence  of  the  job  relevance  of  personnel  selection 
procedures  in  compliance  with  the  Civil  Rights  Act  of  1964.  Second,  CNO 
Objective  Number  CNO-1,  entitled  Retention  of  Career  Personnel  (of 
September  1974),  is  not  addressed  to  the  retention  of  personnel  in 
general,  but  rather,  to  the  retention  of  top  quality  career  personnel. 

The  demonstration  of  top  quality  certainly  is  largely  a function  of  an 
individual's  effectiveness  on  the  job,  and  motivation  to  reenlist  is 
certainly  heavily  influenced  by  advancement  success. 

Highly  effective  validation  procedures  are  available  that  would  be 
responsive  to  the  above  two  requirements.  The  SEQUIN  procedure,  which  was 
demonstrated  with  an  illustrative  job-relevant  criterion,  was  shown  to  be 
quite  useful,  not  only  to  maximize  the  validity  of  a test  using  a subset 
of  items  but  also  to  identify  the  specific  items  which  contribute  to,  and 
distract  from,  prediction  of  the  criterion  behavior. 

Identification  and  Categorization  of  Valid  Items 

Since  SEQUIN  identifies  the  specific  item  selected  in  the  "accre- 
tion" process,  it  also  provides  test  makers  with  the  capability  to  analyze 
and  categorize  the  content  of  each  item.  With  this  knowledge,  certain 
"mixes"  of  various  categories  of  items  could  be  considered  in  the  construc- 
tion of  future  tests.  For  example,  there  might  be  an  optimal  ratio  of 
theoretical  to  applied  type  items  for  maximum  job-relevant  validity.  The 
difference  between  proportions  of  theoretical  and  applied  items  in  the 
first  and  last  25  items  selected  in  the  ADJ3  Exam  was  not  significant. 
However,  with  larger  pools  of  items  (e.g.,  the  first  and  last  50-item  sub- 
sets from  a number  of  exams  of  similar  occupational  specialities) , signi- 
ficant differences  might  be  identified.  Also,  categories  other  than 
theoretical-applied  might  be  studied,  such  as  the  differential  validity 
of  the  content  of  the  subtest  sections. 

Item  Construction  Procedures 


In  the  reliability  analysis  of  five  rate  groups  (see  Table  6), 
the  reliability  of  the  BM2  Exam,  .729,  was  substantially  below  that  of 
the  other  four  groups.  This  result  might  be  a function  of  either  item 
statistical  or  structural  characteristics.  For  example,  the  median  £ 
value  (see  Table  4)  and  I)  value  (see  Table  2)  of  the  BM2  White  group  are 
relatively  low  among  all  White  groups.  (Since  the  Black  and  White  groups 
of  each  rate  group  were  combined  to  calculate  the  reliability,  the  obtained 
value  reflects  primarily  the  distribution  statistics  of  the  majority  White 
group.) 


Although  the  literature  abounds  with  guidance  for  item  writing, 
many  of  the  rules  have  not  been  adequately  evaluated  empirically.  In  one 
empirical  demonstration  of  undesirable  item  characteristics,  Dudycha  and 
Carpenter  (1973)  found  th_.t: 
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1.  An  inclusive  distractor,  such  as  "all  (or  any  or  none)  of 

the  above"  (as  opposed  to  a specific  distractor,  which  is  a specified  word 
or  phrase)  reduces  item  differentiation. 

2.  A negative  stem  structure,  which  includes  the  word  "not" 

(as  opposed  to  a positive  stem  structure,  which  does  noi)  increases  item 
difficulty. 

3.  An  open-stem  structure,  which  requires  the  answer  to  complete 
the  sentence  (as  opposed  to  a closed-stem  structure,  which  is  a complete 
sentence)  increases  item  difficulty. 

4.  The  combination  of  open-positive  stems  and  closed-negative 
stems  in  the  same  test  reduces  item  differentiation. 

It  was  observed  that  all  four  of  these  item  designs  are  used  with  varying 
frequency  in  the  present  advancement  exams,  particularly  in  the  BM2  Exam. 
It  would  thus  be  useful  to  determine  whether  the  use  of  these  (and  perhaps 
other)  structures  contributes  to  undesirable  item  characteristics  (e.g., 
reduced  JP  values  or  I)  values) . 

Also,  median  £ values  and  D values  would  probably  be  increased  by 
raising  the  criterion  values  for  reuse  of  items  (e.g.,  Rvalues  no  less 
than  .30  or  greater  than  .85,  and  r^fc  with  item  in  score,  no  less  than 

.05)  but  subject  to  item  validity  with  an  external  criterion. 

Post  Hoc  Item  Deletion  Procedures 


Although  pretesting  of  items  is  probably  not  feasible,  applica- 
tion of  item  deletion  procedures  which  eliminate  undesirable  items  (e.g., 
those  with  extreme  high  or  low  £ values,  or  low  differentiation  values) 
subsequent  to  administration  but  prior  to  final  scoring  for  selection 
purposes  might  increase  the  reliability  or  validity  of  the  exams.  The 
SEQUIN  accretion  procedures  described  above  demonstrated  that  a subset 
of  items  could  be  selected  that  yields  a higher  validity  than,  and  an 
equally  high  reliability  as,  the  total  set  of  items.  However,  these  results 
should  be  considered  tentative,  because  the  procedure  capitalizes  on  the 
intercorrelations  of  the  sample  data,  and  is  thus  influenced  by  chance. 
Cross-validation  is  necessary  to  ensure  that  the  results  are  not  an  effect 
of  sampling  error  (Henryssen,  1971). 

The  selection  of  items  to  increase  reliability  will  usually  tend 
to  increase  validity  (Henryssen,  1971).  However,  if  excessive  emphasis 
is  placed  on  Increasing  test  homogeneity,  the  test  may  become  too  narrow 
and  one-sided  in  content  to  have  high  validity.  In  the  SEQUIN  demonstra- 
tion with  the  ADJ3  and  BM2  Exams,  many  of  the  items  with  the  lowest  item- 
total  correlation  were  selected  by  an  internal  criterion  near  the  end  of 
the  accretion  process,  but  by  an  external  criterion  near  the  beginning. 

A number  of  reasons  might  account  for  these  results  (other  than 
that  the  use  of  the  present  Performance  Factor  as  an  external  criterion 
may  not  have  been  appropriate,  even  for  illustrative  purposes).  If  the 
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test  content  tends  to  be  heterogeneous,  rather  than  homogenous,  as 
suggested  by  some  of  the  low  intercorrelations  among  section  scores,  then 
internal  consistency  type  measures  of  reliability  may  be  of  limited  rele- 
vance. This  possibility  is  suggested  by  a comparison  between  the  reli- 
ability and  validity  coefficients  of  the  ADJ3  and  BM2  Exams.  Although 
an  internal  consistency  type  measure  of  reliability  places  ari  upper  limit 
on  the  validity  of  a test,  the  situation  only  applies  with  homogenous 
tests.  However,  with  a heterogeneous  test,  elimination  of  items  with  low 
item-total  correlations  could  result  in  the  reduction  of  predictable 
variance.  It  may  be  observed  that  the  reliability  of  the  BM2  Exam  is 
lower,  but  its  validity  is  higher  than  those  of  the  ADJ3  Exam.  Also, 
when  the  correlation  between  two  tests  is  near  .sero  or  slightly  negative 
(as  is  the  ADJ3  Exam  with  the  external  criterion),  the  items  that  correlate 
lowest  with  total  test  score  (i.e.,  the  lowest  r_^  values)  could  very  well 

be  those  that  correlate  highest  with  an  external  criterion. 

Balancing  Item  Biases 

Another  issue  pertains  to  the  question  of  the  compatibility  cf  the 
two  objectives  identified  by  the  Chief  of  Naval  Personnel  to  be  investi- 
gated— the  feasibility  of  compiling  "tests  composed  of  questions  having 
identical  or  correlatable  degree  of  difficulty  (Rho)  factors  for  both 
Blacks  and  Whites."  The  Robertson  and  Royle  (1975)  study  was  addressed 
to  the  first  objective,  "identical"  difficulty;  and  the  Robertson  and 
Montague  (1976)  study,  to  the  second,  "correlatable"  difficulty.  The 
present  study  addressed  both  objectives  in  the  context  of  item  differen- 
tiation and  test  reliability. 

Both  the  Robertson  and  Royle  (1*375)  and  the  present  study  found  that  the 
construction  of  tests  of  items  of  similar  difficulty — from  the  existing 
pool  of  items — was  not  feasible.  The  question  might  be  raised  as  to  the 
existence  of,  or  the  possibility  of  developing,  itrans  on  which  Blacks 
are  superior.  If  such  items  were  found,  tests  might  be  constructed  with 
a "balance"  of  items  in  which  Whites  do  well  on  some,  and  Blanks,  on  o'thers. 
Ironically,  such  tests  would  result  in  increased  racial  bias,  as  measured 
by  a decrease  in  relative  item  difficulty  (Rho  value) . (The  issue  of 
"balancing"  item  biases  is  discussed  briefly  by  Cleary  and  Hilton  (1968) 
and  by  Jensen  (1973).) 

Implications  of  the  Results 

The  demonstrations  of  improved  item  differentiation  by  eliminating 
excessively  difficult  items  and  items  with  low  or  negative  differentia- 
tion suggest  the  need  to  implement  the  item-deletion  and  item-construc- 
tion procedures  discussed.  Such  procedures  would  result  in  a slight 
decrease  in  mean  sco^e  differences  between  Blacks  and  Whites  and,  in  terms 
of  test  quality,  a slight  increase  in  item  uif f erentiatlon  for  Whites  and 
a moderate  increase  for  Blacks.  Also,  any  procedure  that  would  raise  the 
level  of  £ values  would  reasonably  be  expected  to  reduce  the  proportion 
failed  by  the  exam  cut-score,  thereby  enabling  those  who  passed  to  con- 
tinue to  compete  on  their  other  advancement  factors.  Although  such  a 
procedure  was  not  demonstrated  in  the  present  study,  it  is  of  particular 
interest  and  advantage  to  Blacks. 
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However,  the  SEQUIN  demonstration,  in  which  the  items  selected  were 
compared  by  internal  and  external  criteria,  also  suggest  that  items  deleted 
to  increase  item  differentiation  or  test  homogeneity  may  be  the  types  of 
items  that  best  contribute  to  predicting  job-relevant  performance  by  an 
external  criterion.  Thus,  until  externa’  validation  studies  are  performed 
to  determine  the  relationship  of  test  heterogeneity  to  subsequent  perfor- 
mance in  the  grade  to  which  advanced,  recommendations  to  implement  the  pro- 
cedures discussed  above  are  deemed  premature. 


CONCLUSIONS 


1.  Enlisted  Advancement  Exam  item  differentiation  and  internal  con- 
sistency type  test  reliability  could  be  improved  for  both  Blacks  and  Whites 
by  using  item  selection  and  construction  procedures  identified,  developed, 
or  demonstrated  in  this  study. 

2.  The  development  of  tests  in  which  only  the  items  similar  in  difficulty 
for  both  Blacks  and  Whites  are  used  is  not  feasible  because  it  would  reduce 
test  quality.  However,  the  elimination  of  excessively  difficult  items, 

by  either  alternative  item  construction  or  post-administration  item  dele- 
tion procedures,  would  improve  test  quality  and,  in  particular,  benefit 
Blacks,  because  the  proportion  of  candidates  failed  by  the  exam  cut-score 
would  be  reduced,  thereby  enabling  those  who  passed  to  continue  to  com- 
pete on  their  other  advancement  factors. 

3.  The  two  objectives  that  were  identified  for  investigation  in  the 
present  series  of  studies — the  feasibility  of  compiling  "tests  composed  cf 
questions  having  identical  or  correlatable  degree  of  difficulty  . . . 

for  both  Blacks  and  Whites" — may  not  be  compatible.  As  stated  above, 
construction  of  tests  of  only  items  of  "Identical"  difficulty,  at  least 
from  the  existing  pool  of  items,  was  not  feasible.  Using  "balanced"  items 
might  be  an  alternative  to  items  of  "identical"  difficulty.  However,  even 
if  new  items  could  be  developed  on  which  Blacks  were  superior,  and  tests 
then  constructed  with  a "balance"  of  items  in  which  Whites  do  well  on  some 
and  Blacks  on  others,  such  tests  would  be  characteristic  of  reduced  "cor- 
relatable" degree  of  difficulty.  Thus,  the  use  of  a measure  of  relative 
item  difficulty  as  an  indication  of  possible  racial  bias  appears  to  be  of 
limited  relevance  in  a study  directed  towards  identifying  effective  pro- 
cedures to  provide  all  racial  groups  with  similar  opportunities  for  advance- 
ment. 
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RECOMMENDATIONS 

The  fundamental  question  regarding  racial  differences  in  advancement 
should  pertain  to  the  relationship  of  each  selection  factor,  including 
the  present  Technical  Knowledge  Exam,  to  subsequent  job-relevant  performance 
in  the  grade  to  which  selected.  The  results  of  the  final  phase  of  the 
present  analysis  raise  important  new  questions  regarding  differences 
between  the  ’’best"  items  selected  by  an  internal  and  an  external  criterion. 
Thus,  implementation  of  the  procedures  discussed  or  demonstrated  in  the 
present  study  (which  was  at  the  exploratory  level  of  research),  prior 
to  addressing  these  new  questions,  would  be  premature. 

It  is  recommended  that:  (1)  the  empirical  validity  of  the  present 

tests  on  subsequent  performance  be  compared  between  Blacks  and  Whites, 
and  (2)  the  alternative  item  processing  and  item  construction  procedures 
discussed  in  the  present  study  be  validated  and  compared  on  internal  and 
external  criteria. 
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APPENDIX 


METHODOLOGICAL  ISSUES  IN  ITEM  ANALYSIS 

The  calculation  of  item-difficulty  and  item-differentiation  indices 
for  a large  number  of  tests  with  large  subject  pools  permitted  investiga- 
tion of  methodological  questions  as  well  as  the  study  of  racial  group 
differences. 

A number  of  computational  approaches  may  be  used  in  determining  item- 
differentiation  using  the  item-total  relationship,  including  the  and 

I)  value1  techniques  employed  in  this  study.  These  and  other  alternative 
procedures  provide  much  the  same  information.  The  rankings  of  item-differ- 
entiation values  by  alternative  procedures  usually  yield  correlations 
among  the  ranks  in  the  .90's  (Nunnally,  1967).  In  computing  item-differ- 
entiation statistics,  if  the  item  itself  is  included  in  the  total  (or 
section)  score,  some  portion  of  the  correlation  value  obtained  will  be 
an  artifact  from  the  presence  of  the  item  itself  (Nunnally,  1967). 
(Obviously,  the  size  of  this  artifact  will  vary  inversely  with  the  number 
of  items  in  the  test/section.)  Also,  if  a test  contains  subtests  (i.e., 
"sections")  of  differing  content  (i.e.,  a nonhomogenous  type  test),  it 
may  be  more  appropriate  to  compare  item  responses  with  the  subtest  score 
than  with  total  score. 

Alternative  Item  Analysis  Procedures  Employed 

To  investigate  the  effects  of  including  the  item  in  the  total  score 
and  of  computing  item-differentiation  statistics  on  sections  vice  total 
test  scores,  the  following  alternative  statistics  were  computed: 

1.  r^g  (w/  item) — item-section  correlation,  with  the  item  included 
in  the  section  score. 

2.  r,  (w/o  item)  — item-section  correlation,  without  the  item  in- 
cluded in  the  section  score. 

3.  r.^  (w/  item) — item-tci_al  correlation,  with  the  item  included 

in  the  total  score, 

4.  jr  (w/o  item) — item-total  correlation,  without  the  item  in- 
cluded in  the  total  test  score. 


'The  J)  value  of  the  present  study  is  to  be  distinguished  from  the 
Lawshe  (1942)  1)  value,  adopted  from  the  Kelley  (1939)  technique,  which 
expresses  the  difference  between  the  two  scoring  groups  in  terms  of 
sigma  units. 
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5.  value  (w/  Item) — percentage  difference  between  high  and  low 
section  scorers  who  answered  the  item  correctly. 

6.  value  (w/  item) — percentage  difference  between  high  and  low 
total  scorers  who  answered  the  item  correctly. 

D values  (hereafter  referred  to  as  D values)  were  calculated  on  all 

items  for  all  24  rate  groups  employing  the  above  procedure  5.  Althovigh 
this  procedure  produces  values  that  are  overestimates  from  the  presence 
of  the  item  itself  in  the  section  score,  it  was  considered  useful  for  the 
present  analysis,  since  the  primary  interest  concerned  the  relative  size 
of  the  values  between  Blacks  and  Whites,  rather  than  the  absolute  size 
of  the  1)  value. 

Intercorrelations  among  section  and  total  test  scores  were  calculated 
for  four  selected  rate  groups. 

Effects  on  Item-Test  Correlation  From  Including  Item  in  Score 

Table  A-l  presents  item-score  point  biserial  correlations  for  all 
four  alternative  responses  for  seven  selected  items  of  the  HM3  Exam, 
calculated  both  with  and  without  the  item  included  in  the  score.  The 
correlations  between  each  alternative  item  response  and  test  score  were 
found  to  be  higher  when  the  item  was  included  in  the  score  than  when  it 
was  not  included.  This  finding  is  consistent  with  discussions  in  the 
general  literature  (e.g.,  Nunnally,  1967).  Inclusion  of  the  item  in  the 
section  score  frequently  increases  substantially  the  r^g  of  the  correct 

response  alternative  (e.g.,  for  Item  2 alternative  3,  from  .211  to  .424 
for  Blacks,  and  from  .095  to  .379  for  Whites).  Inclusion  of  the  item 
in  total  score,  however,  usually  increases  the  r^t  by  only  .02  to  .04 

correlation  points  (e.g.,  for  Item  130  alternative  1,  from  .235  to  .275 
for  Blacks,  and  from  .188  to  .219  for  Whites).  The  increase  in  iv  , from 

inclusion  of  the  item  in  t he  section  score,  is  greatest  in  the  lowest 
r^g  values  without  the  item  (e.g.,  for  Whites,  from  .055  to  .215  in  Item 

150,  compared  with  .391  to  .449  in  Item  30),  although  the  difference  in 
_r  ? is  slight  (e.g.,  for  Whites,  from  .003  to  .046,  a difference  of  .043 

X s 

in  Item  150,  compared  with  a difference  of  .049  in  Item  30). 

In  calculating  D values,  a similar  procedure  could  have  been  applied 
by  dividing  the  group  into  high  and  low  scorers  for  each  item  on  the 
basis  of  their  score  without  that  item  included.  This  lengthy  procedure 
was  not  applici,  therefore  all  obtained  _D  values  can  be  considered  to  be 
overest imates. 
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Comparison  of  Four  Methods  of  Calculating  Item-Score  Correlations 
On  Seven  Selected  Items  of  the  HM3  Exam 
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Total  or  Section  score  calculated: 

w/  --with  item  in  the  score 
w/o- -without  item  in  the  score 


Comparison  of  Item  Differentiation  by  Saction  and  Total  Score 


As  expected,  D values  were  iound  to  be  higher  than  D values.  As 
s t 

illustrated  with  15  selected  BM2  items  in  Table  A-2,  Black  D values 

— s 

exceeded  values  by  4 to  41  percentage  points  with  four  exceptions 

(e.g.,  in  Item  10,  the  D value  was  lower  by  about  10  points).  Also  the 

& 

rank  order  of  item  differentiation  varied  considerably  both  by  method 
(D^  and  D^)  and  by  race. 


Table  A-3  presents  the  item-score  correlations,  of  the  correct  response 
only,  for  13  items  (including  the  7 items  in  Table  A-l)  from  the  HM3  Exam, 
along  with  corresponding  D values  and  JP  values.2  The  ranks  (among  the  13 
items)  of  alternative  item-differentiation  values  are  quite  similar  across 
method  (e.g.,  r,  and  D , r.  and  r.  , etc.)  when  both  methods  include 

"“~1S  S "IS  X L 

the  item  in  the  score,  and  when  both  methods  exclude  the  item.  However, 

the  ranks  vary  when  one  method  with  the  item  included  is  compared  with 

another  method  with  the  item  excluded.  For  example,  on  Item  110,  the  White 

group  ranks  for  r\  (rank  11)  and  D (rank  12) , with  the  item  in  the  score, 

T X s s 

are  nearly  the  same  compared  to  the  r^g  rank  without  the  item  (rank  6). 

Of  particular  interest  in  Table  A-3  is  the  comparison  between  r^g  and 
values  (without  the  item  included  in  the  score) . If  the  total  test 
contains  section  of  differing  content,  use  of  r^g  may  be  more  appropriate 
than  r^  (as  discussed  on  page  A-l).  Tables  A-4  and  A-5  present  intercorre- 
lations among  section  and  total  scores  for  two  exams.  For  example,  on 
the  HM3  Exam  (see  Table  A-4) , section-section  correlations  range  from 
-.011  (sections  1 and  6)  to  .431  for  Blacks,  and  from  .019  to  .648  for 
Whites.  Section-total  correlations  range  from  .363  to  .814  for  Blacks, 
and  from  .370  to  .904  for  Whites,  (The  section-total  correlations  are 
spuriously  high,  since  the  section  is  included  in  the  total  score.) 


?The  measure  of  item-difficulty  employed  in  this  item-analysis  was 
the  JP  value,  the  percentage  of  a group  which  answers  the  iten.  correctly 
(i.e.,  as  defined  by  Tinkelman  (1971,  p.  62),  the  lower  the  I value,  the 
more  difficult  the  item).  This  measure  is  to  be  distinguished  from  an 
alternative  measure  of  item-difficulty,  Delta  value,  designated  by  the 
Greek  letter  "A.,"  and  characterized  by  higher  A.  values  associated  with 
more  difficult  items.  This  latter  measure  employs  "transformed  criterion- 
scores"  of  the  persons  attempting  the  item  and  is  particularly  appropriate 
in  tests  measuring  speed  of  performance  (Conrad,  1948).  Because  both 
Blacks  and  Whites  tend  to  complete  the  entire  test,  the  simpler  JP  value 
was  used  in  the  present  analysis. 
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Table  A-2 


Comparison  of  Two  Methods  of  Calculating 
Item  Differentiation  of  15  Selected 
Items  of  the  BM2  Exam 


Item 

No. 

On 

Section 

Score 

On 

Total 

Score 

Black 

White 

Black 

White 

D Value 
— s 

Rank 

Dg  Value 

Rank 

_D  Value 

Rank 

Value 

Rank 

10 

11.23 

114 

26.71 

27 

21.54 

26 

12.77 

64 

20 

9.53 

119 

22.70 

58 

15.60 

52 

14.96 

45 

30 

32.46 

32 

16.62 

102 

13.55 

62 

5.96 

111 

40 

7.89 

125 

23.50 

53 

11.28 

75 

19.18 

19 

50 

33.55 

29 

33.55 

6 

6.45 

102 

24.13 

5 

60 

9.67 

118 

9.21 

135 

-1.17 

135 

-1.78 

143 

70 

11.79 

109 

13.26 

114 

17.22 

44 

15.97 

38 

80 

11.31 

113 

17,40 

93 

4.76 

112 

2.58 

131 

90 

9.23 

120 

10.40 

129 

9.38 

84 

7.46 

97 

100 

15.31 

96 

15.31 

105 

9.89 

82 

3.51 

126 

110 

38.14 

13 

15.27 

106 

24.47 

19 

14.01 

52 

120 

12.43 

105 

21.57 

71 

7.62 

97 

11.09 

74 

130 

11.40 

111 

5.14 

146 

4.25 

116 

2.35 

134 

140 

40.77 

9 

27.68 

20 

-1.83 

137 

10.78 

76 

150 

28.72 

44 

8.48 

137 

10.77 

79 

2.49 

133 

Note.  Highest  D-value  was  assigned  Rank  1. 


j 
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Table  A-3 


Comparison  of  Four  Iter.  Statistics  on 
Selected  Items  of  the  HM3  Exam 


Item  No.  5 « Black  White 

« 4 9 L L 


(and  Test 
Section  No.) 

Item 

Score 

1 

r. 

—is 

3 

1 

-?it 

i) 

D b 
— s 

P 

r.  b 
-is 

-'i.' 

3 

4." 

P 

1 

w/ 

393 

3 

158 

8 

28.34  4 

28.9 

388  3 

104 

9 

33.38 

3 

48.4 

(1) 

w/o 

139 

6 

118 

8 

081  8 

072 

9 

2 

w/ 

424 

2 

160 

7 

30.58  2 

19.2 

379  4 

111 

8 

30.73 

4 

31.4 

(1) 

w/o 

211 

5 

125 

7 

095  7 

082 

8 

3 

w/ 

500 

1 

119 

10 

42.95  1 

41.4 

299  7 

032 

13 

24.96 

5 

45.0 

(1) 

w/o 

239 

2 

075 

10 

014  13 

000 

13 

20 

w/ 

310 

5 

289 

2 

15.71  7 

49.0 

486  1 

437 

1 

43.44 

1 

59.2 

(2) 

w/o 

216 

3.5 

247 

2 

427  1 

411 

1 

30 

W / 

307 

6 

212 

6 

14.86  10 

63.5 

449  2 

417 

2 

33.57 

2 

69.7 

(2) 

w/o 

216 

3.5 

170 

6 

391  2 

392 

2 

60 

W / 

294 

7 

073 

12 

19.55  5 

56.7 

239  10 

041 

12 

19.40 

9 

52.9 

(3) 

w/o 

084 

10 

029 

12 

040  11 

008 

12 

70 

w/ 

183 

12 

043 

13 

3.33  13 

26.9 

223  12 

089 

11 

17.28 

11 

32,2 

(3) 

w/o 

009 

11 

003 

13 

037  12 

059 

11 

80 

w/ 

257 

9 

314 

1 

9.91  11 

35.6 

332  5 

271 

3 

23.41 

7 

30.9 

(4) 

w/o 

116 

8 

275 

1 

233  3 

243 

3 

90 

W / 

120 

13 

136 

9 

15.58  8 

34.6 

253  9 

221 

4 

17.89 

10 

35.7 

(4) 

w/o 

-024 

13 

094 

9 

147  5 

192 

4 

no 

W / 

239 

10 

098 

11 

19.44  6 

34.6 

229  11 

126 

7 

16.70 

12 

29.3 

(5) 

w/o 

101 

9 

056 

11 

120  6 

097 

7 

130 

>'/ 

376 

4 

275 

3 

28.98  3 

33.7 

313  6 

219 

5 

24.01 

6 

42.4 

(5) 

u/o 

248 

1 

235 

3 

198  4 

188 

5 

140 

w/ 

210 

11 

239 

4 

8.69  12 

24.0 

291  8 

092 

10 

23.10 

8 

24.3 

(6) 

w/o 

-009 

12 

202 

5 

069  9 

064 

10 

150 

w/ 

265 

8 

235 

5 

15.07  9 

9.6 

215  13 

191 

6 

10.48 

13 

10.6 

(6) 

w/o 

118 

7 

209 

4 

055  10 

172 

6 

Note.  Decimal  points  of  jr.  and_r.  point  biserial  correlations  have  been 
onuttea.  1S  1 

Total  or  section  score  calculated: 

w/  --with  item  in  the  score 
w/o--without  item  in  the  score 

bThe  rank  (among  the  13  items  only)  of  each  value  is  indicated  by  the  smaller 

numbers,  which  are  in  superscript,  highest  value  with  rank  1 (e.g.,  for  Item  20, 

White  r.  of  .427,  calculated  without  the  item  in  section  score,  is  rank  1. 

-is 
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Table  A-4 


Distribution  Statistics  and  Intercorrelations  Among 
Section  and  Total  Scores  of  the  HM3  Exam 


Section 

1 

2 

Black 
3 4 

5 

6 

Total 

1 

308 

001 

122 

183 

-Oil 

383 

2 

283 

364 

431 

114 

814 

3 

163 

152 

158 

456 

4 

419 

248 

684 

5 

101 

701 

6 

363 

Mean 

4.61 

21.33 

9,98 

15.29 

10.64 

5.71 

68.00 

S.D. 

1.73 

4.54 

2.33 

3.32 

3.41 

1.96 

11.17 

Section 

1 

2 

White 
3 4 

5 

6 

Total 

1 

273 

158 

192 

219 

109 

370 

2 

433 

629 

648 

255 

904 

3 

358 

352 

142 

573 

4 

542 

213 

797 

5 

233 

800 

6 

388 

Mean 

5.15 

21.22 

10.58 

16.63 

11.63 

6.19 

73.45 

S.D 

1.60 

5.29 

2.50 

4.37 

4.07 

1.92 

15.53 

Note.  Decimal  points  for  correlations  have  been 
omitted . 


Table  A-5 


Distribution  Statistics  and  Intercorrelations  Among 
Section  and  Total  Scores  of  the  BM2  Exam 


Section 

1 

2 

3 

Black 
4 5 

6 

7 

8 

Total 

1 

236 

191 

363 

349 

200 

057 

374 

550 

2 

432 

241 

295 

190 

357 

404 

724 

3 

253 

294 

438 

320 

159 

721 

4 

322 

234 

259 

284 

582 

5 

125 

055 

274 

544 

6 

305 

191 

543 

7 

017 

503 

8 

527 

Mean 

7.47 

13.58 

12.35 

5.39 

5.64 

5,32 

5.39 

4.97 

60.12 

S.  D. 

2.31 

3.45 

3.59 

2.09 

1.96 

1.95 

2.09 

1.79 

11.70 

Section 

1 

2 

T 

J 

White 
4 5 

6 

7 

8 

Total 

1 

257 

239 

204 

174 

157 

242 

217 

564 

2 

354 

224 

150 

221 

289 

144 

650 

3 

240 

239 

201 

273 

255 

694 

4 

152 

082 

236 

163 

515 

5 

093 

146 

201 

452 

6 

172 

137 

421 

7 

256 

580 

8 

500 

Mean 

8.26 

13.71 

12.63 

5,78 

5.78 

5.83 

6.07 

5.31 

63.43 

S.D. 

2.39 

3.02 

3.25 

2.23 

1.93 

1.75 

2.18 

1.95 

10.56 

Note.  Decimal  points  for  correlations  have  been  omitted. 


A-8 


It  might  be  reasonable  to  assume  that,  if  the  section-totaa.  correla- 
tion is  low,  r^g  would  be  higher  than  and  more  appropriate  than  (if 

the  section  content  is  assumed  to  be  homogenous).  However,  these  assump- 
tions are  not  supported  by  the  few  illustrative  items  of  the  HM3  Exam  in 
Table  A-3.  For  example,  for  Blacks,  is  higher  than  on  the  lwo  items 

(140  and  150)  from  section  6,  although  this  section  had  the  lowest  section- 
total  correlation  (.363  in  Table  A-4).  Of  the  two  items  (20  and  30  in 
Table  A-3)  from  the  section  with  the  highest  section-total  correlation 
(.814  in  Table  A-4),  one  r_^t  is  higher,  and  the  other  is  lower  than  _r^  . 

In  the  light  of  varying  differences  between  r_^g  and  and  among 

section-total  correlation  (including,  quite  likely,  even  sections  of  hetero- 
geneous content),  generally,  the  most  useful  measure  of  item  differentiation 
appears  to  be  (without  the  item  included  in  total  score).  (Nonethe- 
less, use  of  D with  item  in  section  score  is  considered  useful  and 
— s 

adequate  for  analyzing  the  relative  differences  between  racial  groups 
in  the  present  study.) 

Relationship  Between  P and  D Values 

When  the  corresponding  £ values  for  the  highest  I)  values  were  examined 
(see  Table  4 and  page  7),  the  median  _P  value  of  the  highest  D values  was 
generally  higher  than  the  total  median  £ value.  Similar  resales  were  also 
obtained  with  the  corresponding  JP  values  for  the  highest  values  in 

Table  A-6.  With  one  exception  (the  MM3  Black  group),  these  corresponding 
£ values  are  higher  than  the  total  test  median  JP  value.  For  example,  the 
corresponding  median  P_  value,  54,19,  for  the  highest  _r  values  of  the 

ADJ3  White  group,  is  substantially  higher  than  the  total  test  median  JP 
values,  45.81  for  that  group. 

Table  A-3  also  provides  examples  of  high  _P  values  which  yield  high  or 

low  differentiation  values  (e.g.,  for  the  White  group,  ' tem  20  P value  of 

59.2  with  r..  without  the  item  in  score  of  .411,  but  T tem  60  P value  of 
—it  — 

52.9  with  _r^t  of  only  .008),  and  low  £ values  which  yield  high  or  low 
differentiation  values  (e.g.,  Item  80  £ value  of  30.9  with  _r  of  .243, 
but  Item  70  JP  value  of  32,2  with  r^t  of  only  .059. 

Reversing  the  orientation  and  comparing  £ values  with  corresponding 

I)  values  yielded  similar  results  (see  Table  A-7).  The  £ values  of  middle 
difficulty  (e.g.,  ADJ3  Black  group,  median  £ value  of  34.04)  yield  corre- 
sponding J)  values  (e.g.,  41.12)  which  are  substantially  lower  than  the 
highest  JD  values  (e.g.,  ADJ3  Black  median,  54.41,  in  Table.  4,  page  10). 
Figures  A-l,  A-2,  and  A-3  display  the  median  £ values  and  corresponding 

J)  values  for  the  7-item  ranked  sets  of  items  in  Table  A-7.  It  may  be 
observed  that  the  highest  J?  values  yield  corresponding  I)  values  which  are 
higher  than  the  corresponding  I)  values  of  the  lowest  JP  values  for  both 
Blacks  and  Whites. 
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Figure  A-3.  Median  P values  and  corresponding  D values  by  race 
(MM3  Exam). 
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Findings 


In  the  methodological  comparisons  of  alternative  measures  of  item 
difficulty,  the  item-score  correlations,  with  the  item  included  in  the 
score  were  greater  than  without  the  item  included.  With  the  item  in 
the  score,  the  item-total  correlation  (r^  ) was  greater  by  about  .02 

to  .04  correlation  points,  and  the  item-section  correlation  (r^  ) was 

greater  by  large  and  varying  amounts. 

In  comparisons  of  item-section  and  item-total  measures,  the  percentage 
difference  between  high  and  low  scorers  answering  the  item  correctly  was 
higher  on  the  item-section  percentages  (1^  values)  than  on  the  item-total 

percentages  (E^.  values)  with  the  item  included  in  both  scores. 

Item-section  (r^g)  and  item-total  (r^  ) correlations,  without  the 

item  included  in  the  score  of  either,  varied  as  to  which  was  the  larger. 
Section  score  intercorrelations  within  each  total  test  varied  from  low  to 
high  values,  suggesting  some  heterogeneity  in  sone  tests  or  some  sections 
of  tests.  (Heterogeneity  would  tend  to  reduce  r^g  or  _r^  values.)  In 

light  of  these  varying  differences,  the  most  useful  measure  of  item  differ- 
entiation appears  to  be  r_  without  the  item  included  in  the  total  score. 

The  P values  which  corresponded  to  the  highest  D values  or  r,  values 

were  higher  than  the  median  £ values  for  the  total  tests,  suggesting  that 
easier  items  might  improve  item  differentiation.  In  the  comparison  of  the 
ends  of  the  jP  value  ranges,  the  highest  Rvalues  (i.e. , easiest  items)  had 
corresponding  J}  values  which  ware  higher  than  the  corresponding  ])  values 
of  the  lowest  jP  values,  which  suggests  that  the  difficult  items  are 
excessively  difficult. 
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